Improving the Performance of Loop-Based Programs Using a Prefetch Processor

نویسندگان

  • Stephen P. Crago
  • Alvin M. Despain
چکیده

We present an architecture called the CAPP (Computing And Prefetching Processor). The CAPP provides high performance for loop-based scientific and signal processing programs by improving memory system performance by providing a decoupled prefetch processor. The prefetch processor improves performance by relieving the main processor of prefetching instruction overhead and allowing the prefetch distance to vary adaptively at run-time. In this paper, we present the CAPP architecture, a sample program to show how the architecture works, and simulation results for five Livermore Loops, discrete convolution, and one other benchmark. The simulation results show a speedup of up to two to three for CAPP compared to a uniprocessor with prefetching. The performance advantage of the CAPP architecture increases as the miss penalty gets larger relative to processor cycles, making it an attractive architecture as the difference between processor speed and DRAM speed continues to grow exponentially.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tolerating Latency by Prefetching Java Objects

In recent years, processor speed has become increasingly faster than memory speed. One technique for improving memory performance is data prefetching which is successful in array-based codes but only now are researchers applying to pointer-based codes. In this paper, we evaluate a data prefetching technique, called greedy prefetching, for tolerating latency in Java programs. In greedy prefetchi...

متن کامل

Data Prefetching using a modified Markov Predictor

As processors are clocked faster and faster, the gap between processor and memory speeds grows exponentially. This difference in performance causes significant memory latencies for programs with a large number of pointer references. C++ and Java programs suffer performance hits, because of this memory delay. Several methods have been developed to overcome this gap, including hardware and softwa...

متن کامل

Design and Implementation of Digital Demodulator for Frequency Modulated CW Radar (RESEARCH NOTE)

Radar Signal Processing has been an interesting area of research for realization of programmable digital signal processor using VLSI design techniques. Digital Signal Processing (DSP) algorithms have been an integral design methodology for implementation of high speed application specific real-time systems especially for high resolution radar. CORDIC algorithm, in recent times, is turned out to...

متن کامل

A Compiler-Assisted Data Prefetch Controller

Data-intensive applications often exhibit memory referencing patterns with little data reuse, resulting in poor cache utilization and run-times that can be dominated by memory delays. Data prefetching has been proposed as a means of hiding the memory access latencies of data referencing patterns that defeat caching strategies. Prefetching techniques that either use special cache logic to issue ...

متن کامل

Reducing the Traffic of Loop-Based Programs Using a Prefetch Processor

Large cache block sizes are used to take advantage of spatial locality and amortize long memory latency over more words. However, the cost of large cache block sizes is increased memory traffic requirements, especially for applications that show poor spacial locality. Software prefetching is usually presumed to increase memory traffic. We present an architecture that uses a separate processor d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007